Tobacco use status from clinical notes using Natural Language Processing and rule based algorithm

Abstract

BACKGROUND:

This cross-sectional retrospective study utilized Natural Language Processing (NLP) to extract tobacco-use associated variables from clinical notes documented in the Electronic Health Record (EHR).

OBJECITVE:

To develop a rule-based algorithm for determining the present status of the patient’s tobacco-use.

METHODS:

Clinical notes ( $n=$ 5,371 documents) from 363 patients were mined and classified by NLP software into four classes namely: “Current Smoker”, “Past Smoker”, “Nonsmoker” and “Unknown”. Two coders manually classified these documents into above mentioned classes (document-level gold standard classification (DLGSC)). A tobacco-use status was derived per patient (patient-level gold standard classification (PLGSC)), based on individual documents’ status by the same two coders. The DLGSC and PLGSC were compared to the results derived from NLP and rule-based algorithm, respectively.

RESULTS:

The initial Cohen’s kappa ( $n=$ 1,000 documents) was 0.9448 (95% CI $=$ 0.9281–0.9615), indicating a strong agreement between the two raters. Subsequently, for 371 documents the Cohen’s kappa was 0.9889 (95% CI $=$ 0.979–1.000). The F-measures for the document-level classification for the four classes were 0.700, 0.753, 0.839 and 0.988 while the patient-level classifications were 0.580, 0.771, 0.730 and 0.933 respectively.

CONCLUSIONS:

NLP and the rule-based algorithm exhibited utility for deriving the present tobacco-use status of patients. Current strategies are targeting further improvement in precision to enhance translational value of the tool.

Keywords

Data mining decision support systems clinical health information systems smoking electronic health records information storage and retrieval

1. Introduction

Despite the fact that tobacco use is the most common preventable cause of death, every year about 480,000 people in the United States (U.S) and about 6 million people globally die from tobacco-related illnesses [1, 2]. Studies have shown that tobacco consumption is the most putative etiological factor in the development of cancers in the oral cavity, lungs and breast and represents a high risk to patients globally [3, 4, 5, 6, 7].

The National Ambulatory Medical Care Survey (NAMCS) defines “tobacco use” as documentation of latest status of tobacco use in the patient’s medical record [8]. Clear and compelling evidence found by the U.S. Preventive Services Task Force (USPSTF) showed that intervention by healthcare providers utilizing strategies such as interview, counselling and documentation of patient’s tobacco use status were effective for tobacco use cessation, prevention and oral cancer surveillance [9, 10].

Historically, healthcare providers have recorded tobacco use status as free-text in clinical notes (corpora) to document patient tobacco use history [11, 12]. Presently, the documentation of tobacco use as a part of Meaningful Use (MU) criteria is entered in a structured format in the Electronic Health Records (EHRs) [13]. The structured data informs whether the patient is a current, former or a nonsmoker while the corpus captures details surrounding frequency, duration and cessation of tobacco use. At every subsequent patient visit, the EHR only shows the latest tobacco use status but not a collective status.

As an aspect of an oral cancer risk assessment research initiative (OCRARI), this study was proposed to develop a pipe-line to mine and include variables related to tobacco use from the EHR for predicting oral cancer risk [14]. The input variables for use in OCRARI required the collective status of the patient’s tobacco use. Marshfield Clinic, a large regional healthcare delivery system serving communities in western, central and northern regions of Wisconsin, implemented structured data collection of tobacco use in 2005 in its home-grown EHR. In addition to the structured data, details including tobacco use history, patient age at time of tobacco consumption and cessation were recorded in the corpora. Thus the structured data captured only some aspects of tobacco use (such as patient’s narrative of tobacco status at the time of visit) while the corpora provided a rich source of supplemental relevant information. This structured information was not available for all the patients, and hence to address this research question, an effort to electronically extract relevant information from the corpora to derive the present tobacco status by applying rule-based algorithm was undertaken in this study.

1.1 Related work

Natural Language Processing (NLP) has proven to be very useful in extracting information from the free-text documents and presenting it in a structured format [15, 16, 17, 18]. State-of-the-art NLP systems are efficient at identifying text in medical record documents surrounding symptomology, diagnoses and other aspects associated with health-related conditions [19, 20, 21, 22, 23]. However, limitations such as misinterpreted sentence boundaries, missing terms and word ambiguities have also been identified [24]. Compiling and synthesizing all relevant documents associated with a single patient into a cohesive ‘big picture’ is challenging even when attempted manually [25].

Historically, studies have applied rule-based classifiers to consolidate patient documents surrounding smoking status in order to stratify patients into one of the following four classes: ‘Current smoker’, ‘Past smoker’, ‘Nonsmoker’ and ‘Unknown’ [5, 26, 27]. Liu et al. showed that the patient-level classification could be accomplished by taking into account the frequency of occurrences of ‘Nonsmoker’, ‘Past Smoker’ and ‘Current Smoker’-classified documents in designating the overall classification of the patient [5]. Other studies incorporated temporal relationships by taking dates into consideration during patient-level classification [26]. For example, Nikfarjam et al. extracted dates from the corpora to create a ‘timeline’ of the patient’s medical history and classified the patient accordingly [26]. This was done by association of events to the admission and discharge times mentioned in the corpora.

1.2 Research objective

The objective of this retrospective study was to electronically extract relevant information from the corpora to deduce a collective tobacco use status of a patient from the corpus using rule-based algorithm. Moreover, the application of this study will extend to the identification of collective status of tobacco use based on the history of the patient’s tobacco consumption. Findings from this study may have inherent clinical utility based on establishment of capacity to correctly classify tobacco utilization status through data mining, thus enabling researchers and healthcare providers to identify patient cohorts, who may be eligible for tobacco use counselling and ultimately tobacco use cessation.

2. Materials and methods

The current study was conducted in two phases: first, to utilize NLP to extract tobacco use associated variables from clinical notes documented in the EHR and second, to develop an algorithm for determining the collective status of the patient’s tobacco use. This was contextually derived from the longitudinal history of tobacco usage recorded in corpora, irrespective of whether status was captured as structured data. The methodology presented herein describes a novel deductive strategy that defines collective status of patient’s tobacco use based on history captured in EHR clinical notes.

The study was reviewed and approved by the Marshfield Clinic Research Institute’s Institutional Review Board for research involving human subjects. A cross sectional retrospective study design was applied targeting extraction of data from Marshfield Clinic’s Enterprise Data Warehouse (MC-EDW) which houses clinically collected data surrounding the patient smoking status history among myriad of other patients’ health related data.

2.1 Data retrieval

Retrospective data was mined from MC-EDW canvasing the temporal frame of 36 years from 1979 through 2015. We randomly selected 400 patients from a pool of approximately 10 million patients and retrieved their corresponding clinical notes (5,371 documents) for analysis. For every clinical document, there existed a file ID, document ID and patient ID. File ID and document ID are reference identifiers within the electronic medical record documentation database. Patient ID is a number assigned to a patient for identification purposes.

Documents were further tagged with a ‘service date’, which corresponded to the date on which the document was created. The service date provided a temporal trajectory for each patient.

2.1.1 Phase I: Corpus-level classification

The NLP software used was cTakes (Clinical Text Analysis and Knowledge Extraction System), an open-source software hosted by the Apache Software Foundation. Each document was processed by the cTakes software [28]. The sentences in the documents were parsed and tokenized. We used cTakes smoking pipeline to partition and sort patients into one of the following four classes: ‘Current smoker’, ‘Past smoker’, ‘Nonsmoker’ and ‘Unknown’, informed by the historical data in their respective corpora. A 30-day abstinence threshold was defined as a prerequisite for assigning a ‘past smoker’ status where a history of tobacco use was recorded [1, 29, 30]. ‘Current smoker’ status was based on the definition of the National Survey on Drug Use and Health (NSDUH), and based on tobacco use habits reported by the patient over the preceding 30 days [1]. Feature extraction was done by cTakes using sentence boundary detector, tokenizer, part-of-speech tagger, shallow parser and named entity recognizer [28]. The part of speech tagger classified corpora that had features relevant to tobacco consumption in the present tense (e.g. smoking, tobacco chewing, smoking among others) as ‘Current Smoker’. Similarly, corpora which had features mentioned in the past tense (e.g. smoked, chewed etc.) or had a mention of ‘quit smoking’ were classified as ‘Past Smokers’. The ‘Nonsmoker’ status was assigned to corpora where there were specific mentions of the term ‘nonsmoker’ or ‘does not smoke’ or phrases in those lines. The ‘Unknown’ class was defined by the absence of any tobacco use information in a corpus. The output from cTakes was in the form of XML (Extensible Markup Language) documents corresponding to each corpus analyzed. A post-processor (Windows-based application developed in C#) was subsequently used to analyze XML documents (described further in the Post-processing section). The architecture of this process is shown in Fig. 1.

Figure 1.

Text analysis architecture.

Figure 2.

User Interface of the data abstraction tool displaying the annotated corpus.

2.1.1.1. Post-processing

The post-processor received the cTakes XML outputs as inputs and carried out further processing. The ‘File Id’ and ‘Document Id’ were captured from the XML files and mapped back to the ‘Patient Id’, thus establishing the association between the clinical note and the corresponding patient. The XML files also provided information about the document’s classification into the previously defined tobacco use classes: Current Smoker, Past Smoker, Nonsmoker and Unknown. After classification of all documents, the rows were arranged first by ‘Patient Id’ (ascending order) and then by ‘Service Date’ (descending order). This helped define the temporal trajectory for each patient.

2.1.1.2. Validation

All 5,371 documents were classified by two coders (NS and HH) using a data abstraction tool (Fig. 2). This tool was developed for the purpose of ascertaining cTakes smoking pipeline reliability with respect to classifying documents into the 4 tobacco-use status classes. The data abstraction tool read each clinical note and highlighted the keywords relevant to tobacco use. An expandable keywords-list was dynamically created which was continually updated by the coders based on recurrence of these words within clinical documents and facilitate their identification during document scans. The main body of the interface contained the clinical note. The top right corner of the interface provided with the document count and navigation buttons were used to scan the document as follows. The ‘Previous Doc (document)’ and ‘Next Doc’ helped navigate to the previous and next documents, respectively. In addition, the ‘Go to Doc’ button could be used to navigate directly to a specific clinical note provided that the document count was known. All terms highlighted in blue were present in the list of keywords (‘tobacco’, ‘smoker’, ‘quit’ etc.). The keywords were added to the software by clicking on the ‘Dictionary’ button. Creation of the glossary expedited identification of relevant terms in the document. Once the annotation was completed, the “Review Done” checkbox was selected by the annotator. The ‘Assessment’ section at the bottom of the interface captured the cTakes’ evaluation result. The output of the cTakes result was displayed on the right side (NLP Results) in green whereas the left side (Smoking) documented manual annotation. If the result from cTakes was in agreement to the coders’ classification of the document, the checkbox, “I agree with the NLP results for this document” was checked by the annotator. If consensus for classification was not achieved, the annotator selected the correct class.

2.1.1.3. Statistical analyses

The two coders each initially classified 1,000 same documents manually to identify any disparities in their logic and to develop a consensus process. The interrater agreement was determined by calculating the Cohen’s kappa derived from SAS proc FREQ with AGREE option, for the classification of these 1,000 documents. The two coders later classified 4,000 documents (2,000 each) based on the rules established during the consensus process. Finally the two coders independently classified the remaining 371 documents. The Cohen’s kappa calculations were compared with the inclusion of a 95% confidence interval which was constructed by applying the exact Binomial Distribution Approach.

Consensus: When inconsistencies in classification of a document between the two coders occurred, another member of the research team (AA) mediated establishment of consensus on for definition of the final classification. This set the classification rules and mitigated any ambiguities, establishing the gold standard for future classification by the coders. The final algorithm was evaluated for precision, recall, specificity, F-measure and accuracy for each tobacco use classification based on the outcomes of manual validation.

Precision was defined as the proportion of the predicted ‘true positive’ cases actually being real positive cases and was calculated by the following formula:

$\displaystyle\textit{Precision}=\frac{\textit{true positive}}{(\textit{true % positive}+\textit{false positive})}$ (1)

Recall was defined as the proportion of correctly predicted ‘true positives’ and was calculated by the following formula:

$\displaystyle\textit{Recall}=\frac{\textit{true positive}}{(\textit{true % positive}+\textit{false negative})}$ (2)

Specificity was defined as the proportion of correctly predicted ‘true negatives’ and was calculated by the following formula:

$\displaystyle\textit{Specificity}=\frac{\textit{true negative}}{(\textit{true % negative}+\textit{false positive})}$ (3)

F-measure is defined as the harmonic mean of precision and recall and was calculated by the following formula:

$\displaystyle\textit{F-measure}=2*\frac{\textit{precision}*\textit{recall}}{(% \textit{precision}+\textit{recall})}$ (4)

Accuracy was calculated by the following formula:

$\displaystyle\textit{Accuracy}=\frac{(\textit{true positive}+\textit{true % negative})}{(\textit{true positive}+\textit{true negative}+\textit{false % positive}+\textit{false negative})}$ (5)

A ‘True positive’ for a given class corresponded to accurate classification of the document into a given class. Based on this definition, a ‘false positive’ for a class corresponded with incorrect classification of a document into that class. Similarly ‘true negative’ for a class corresponded with a document being correctly classified as not corresponding with a class other than the one for which it was a ‘true positive’, and ‘false negative’ for a class corresponded to a document being incorrectly classified as negative in an index class. A low value for ‘true positive’ and/or high value for ‘false positive’ was associated with low precision.

2.1.2 Phase II: Patient-level classification

All available corpora-based data for each patient were analyzed by a rule-based classifier developed as part of this study, which determined the present tobacco-use status of the patient based on the timestamps of the corpus generation.

Rule-based algorithm: The program first arranged all documents by patient and then chronologically with the latest document on top followed by the earlier ones. A 30-day tobacco-use abstinence period was set to distinguish current and past smoker status. If the latest document was classified as ‘Current Smoker’, the patient was classified as ‘Current Smoker’ and preceding history had no impact on the final classification. If the latest document was classified as ‘Past Smoker’, the previous document classifications were examined. If any of the previous documents were classified as ‘Current Smoker’ and was created within 30 days of the latest document, the patient was automatically re-classified as a ‘Current Smoker’. If a patient had just two documents created within 30 days with the latest one classified as ‘Nonsmoker’ and the other one as ‘Current Smoker’, the patient would be re-classified as a ‘Current Smoker’. Using the same example, if the timespan between the dates was $>$ 30 days, the patient was classified as a ‘Past Smoker’. Figure 3 shows the rule-based classification of patients that addresses all the possible combinations of document classifications which would ultimately help categorize the index patient associated with them. If the latest document associated with a patient was classified as ‘Unknown’, all the past documents were scanned to find one of the other three classifications and all the above mentioned rules were applied until all documents for that patient was scanned. Thus, the rule-based algorithm consolidated the collective tobacco use to determine the present status of the patient’s tobacco use. Patient-level data consolidation by the rule-based algorithm was applied to both manual document classification as well as classification by cTakes to derive a gold-standard manual patient-level classification and cTakes-derived patient-level classification, respectively.

2.1.2.1. Validation

The same two coders manually classified tobacco-use status of 40 randomly-selected patients based on the gold standard document classification as well as the document-level classification derived from cTakes. This process tested and validated the rule-based algorithm developed during the post-processing (Fig. 3). Since the rules for this algorithm were strictly set, there could be no ambiguities between the annotator decisions and hence an inter-rater agreement was not determined at this level.

Figure 3.

The classifications for each clinical document were consolidated to form a patient-level classification based on a rule-based algorithm developed by the team.

3. Results

Of the 400 patients randomly selected for the study analysis, 37 patients did not have any clinical notes documented and hence were excluded from the study. Out of these, 363 patients yielded a corpus of 5,371 free-text clinical documents with an average of 15 documents per patient.

3.1 Phase I

Cohen’s Kappa for first 1000 common manually validated documents was 0.9448 (95% Confidence Interval (CI) $=$ 0.928–0.962), while the Cohen’s Kappa for the 371 documents after consensus was 0.9889 (95% CI $=$ 0.977–1.000).

The gold standard developed by manually classified documents categorized 758 (14.12%), 1401 (26.08%), 1795 (33.42%) and 1417 (26.38%) documents to ‘Current Smoker’, ‘Past Smoker’, ‘Nonsmoker’ and ‘Unknown’ respectively. Consensus for classification by cTakes with manual classification achieved correct classification for each category as follows: 56.51% (694/1228), 92.14% (891/967), 89.47% (1597/1786) and 99.78% (1387/1390) with F-measures of 0.700, 0.753, 0.893 and 0.988 respectively. The overall accuracy of document level classification by the cTakes smoking pipeline was 85.1% (95% CI $=$ 0.841–0.860).

3.2 Phase II

Patient-level consolidation of the documents using the rule-based algorithm, yielded 51 (14.05%) patients tagged as ‘Current Smoker’, 191 (52.62%) tagged as ‘Past Smoker’, 113 (31.13%) tagged as ‘Nonsmoker’ and 8 (2.2%) patients with ‘Unknown’ information. Consensus for classification by cTakes with manual gold standard patient classification achieved correct classification for each category as follows: Current Smoker: 96.08%; Past Smoker 71.21%; Nonsmoker: 61.07% and Unknown: 87.50% with F-measures of 0.580, 0.771, 0.730 and 0.933, respectively. The overall accuracy of the patient level classification was 71.9% (95% CI $=$ 0.670–0.765).

Manual validation of 40 randomly-selected patients undertaken by the two coders matched the results of the rule-based algorithm classification by 100%. Of the 40 patients, one patient belonged to the ‘Current Smoker’ class, $n=$ 21 to ‘Past Smoker’, $n=$ 17 to ‘Nonsmoker’ and one to the ‘Unknown’ class.

4. Implications

We analyzed over 5000 documents to extract tobacco-use related information from the clinical notes of patients. When a corpus contained the phrase “Tobacco Use: None”, cTakes would perceive it as 2 sentences “Tobacco Use” and “None” thus classifying the corpus as ‘current smoker’ instead of ‘nonsmoker’. In order to correct this artifact, we pre-processed the corpora by replacing ‘: (colon)’ with a space because cTakes considered ‘:’ as a sentence separator.

Table 1
Predicted classification for Phase I (cTakes) in form of confusion matrix

Percentage (%)	Current smoker	Past smoker	Nonsmoker	Unknown
Gold standard
Current smoker	91.56	6.07	2.12	0.27
Past smoker	24.56	63.6	11.78	0.08
Nonsmoker	9.53	1.45	89.03	0
Unknown	1.35	0.29	0.5	97.89
Metrics
Precision	56.60	92.20	89.50	99.80
Recall	91.60	63.60	89.10	97.90
Specificity	87.90	98.00	94.10	100.00
F-measure	70.00	75.30	89.30	98.80

Table 1 shows the predicted classification of Phase I in form of confusion matrix. Notably, we observed that the F-measure for the ‘current smoker’ class was lower than for other classes. The low F-measure for the ‘current smoker’ class is likely attributable to its low precision. It was observed during manual validation that the context of language related to smoking introduced challenges for the cTakes software in classifying a document into a ‘current smoker’. For example, one of the clinical notes documented patient exposure to second-hand smoke from a roommate, but no conclusive evidence ascertained that the patient was ever an active smoker. Another instance associated with cTakes misclassification occurred when patients were asked whether they had smoke-detectors in their residence. The presence of ‘smoke’ in the text led to a misclassification of the patient as a ‘current smoker’. A potential solution to improve precision is found in a study by Kang et al., who developed an NLP module (consisting of submodules) to improve precision by targeting reduction of false positives through compilation of terms identified through error analysis that were associated with false positive classification [31]. The study used ‘Filtering’ submodule which consisted of two rules. One of the rules removed a tagged concept (for instance: ‘smoke-detector’ as the example from our study) and ultimately improved precision by 18.7% [31]. In the interest of developing a clinically applicable tool, improvement of precision through this approach is in progress.

Table 2 shows the predicted classification of Phase II in form of confusion matrix. The second lowest F-measure was recorded for the ‘past smoker’ class and was likely attributable to the low recall for the same class. For example, one of the patients misclassified by cTakes as a ‘past smoker’ contained text indicating that the patient had a family history of smoking. This would not classify the patient as a smoker, but presence of terminology alluding to ‘smoking’ and ‘history’ resulted in the ‘past smoker’ classification error. Although cTakes generally classified documents into the 4 classes with high accuracy, certain limitations as provided in the examples above were noted, identifying that contextual framework of statements was critical to accurate classification.

Table 2

Predicted classification for Phase II (cTakes) in form of confusion matrix

Percentage (%)	Current smoker	Past smoker	Nonsmoker	Unknown
Gold standard
Current smoker	96.08	1.97	1.97	0
Past smoker	25.66	71.21	3.15	0
Nonsmoker	16.82	22.13	61.07	0
Unknown	12.5	0	0	87.5
Metrics
Precision	41.50	84.00	90.80	100.00
Recall	96.10	71.20	61.10	87.50
Specificity	75.40	82.80	96.50	100.00
F-measure	58.00	77.10	73.00	93.30

The smoking pipeline in cTakes has previously been used by other studies. In one of the studies the precision, recall and F-measure, respectively for the classes were reported as: current smoker (0.3, 0.81, 0.43); past smoker (0.87, 0.6, 0.71) with a reported overall accuracy of 69% [4]. Another study [5] reported the precision, recall and F-measure, respectively for the classes as: current smoker (0.48, 0.9, 0.63); past smoker (0.9, 0.54, 0.67); nonsmoker (0.93, 0.85. 0.88). Later study [5] did not report accuracy, and data regarding rates of true positive and true negatives were not published so that extrapolation of accuracy was not possible for comparison purposes. Based on the document level results reported, our approach demonstrated improvement in overall accuracy in comparison to those reported in previous studies. The improved outcomes may also be partially attributable to our use of a later version of cTakes (version 3.1.1) compared to outcomes achieved in the other studies in which reported application of earlier versions 2014 (version 2.5) and 2012 (version 1.1.0), respectively. However, irrespective of the cTakes version used, the pattern of the ‘Current Smoker’ class having a low precision compared to the other classes and the ‘Past Smoker’ class having a lower recall than the other classes persisted across all of the studies mentioned above.

The main limitation we noticed in the cTakes smoking pipeline was the ‘smoking’ terminology. The system worked at identifying tobacco use in documents irrespective of route of tobacco exposure (i.e., smoking or chewing.) For example, if a patient had been documented with chewing tobacco exposures, cTakes would automatically classify that document (and ultimately that patient) as a ‘Current Smoker’ applying incorrect terminology. A main study limitation was that the data we analyzed was captured via a single institutional EHR. Additional approaches will need to be defined to distinguish tobacco smokers from tobacco chewers. Another important aspect of smoking history not considered in the current approach to smoking status classification centers around availability of quantitative information documented in the EHR with respect to tobacco use/exposure. For example, ‘cigarette-months’ or ‘cigarette-years’ were captured in some of the patient records. This rule based algorithm could widely assist healthcare providers in determining the present patient’s tobacco use status and in the context of historical tobacco use behavior, provide smoking cessation advice. In the future, this approach may also be extended to symptomology classification associated with other diseases and other classification studies.

5. Conclusion

The findings of the study conclude that Natural Language Processing and a rule-based algorithm can be employed to determine the latest tobacco consumption status of patients from their clinical-notes. This provides a broad overview of the tobacco use behavior of patients, taking their tobacco use history into consideration instead of the ‘snapshot’ of their status on the day of the appointment. This alleviates the time-consuming process of going through the clinical notes in the patient’s history to manually deduce the latest tobacco consumption status. For example, if the individual was identified as a smoker, the history would provide the clinician information regarding whether the individual was a new smoker, relapsed smoker or had an established smoking history. The present status of tobacco use would help the researchers and healthcare providers to identify the patient cohort through data mining. The study findings will be applied to mine input variables for the OCRARI project. Furthermore, future work would involve more detailed analyses about the tobacco consumption for e.g. exposure type (cigarettes, cigars, pipes, snuff, e-cigarette, etc.), duration and frequency of the indulgence (e.g. smoked for 5 years, twice daily etc.).

Footnotes

Acknowledgments

The authors would like to thank Marie Fleisner from the Office of Scientific Writing at Marshfield Clinic Research Institute for her assistance with reviewing and editing this manuscript. The authors would like to thank Ms. Barbara Bartkowiak, Marshfield Clinic reference librarian, who helped in literature search. The authors would also like to thank Mr. Eric LaRose from Office of Research Computing and Analytics for his contribution towards modifying the abstraction software for the purpose of this study. The authors would like to thank Aloksagar Panny from Center for Oral and Systemic Health for his assistance with formatting of this manuscript. This work was supported, in part, by a grant from Delta Dental of Wisconsin, funds from Marshfield Clinic Research Institute and Family Health Center of Marshfield, Inc., and by the Clinical and Translational Science Award (CTSA) program, through the NIH National Center for Advancing Translational Sciences (NCATS), grant UL1TR000427. Its content is solely the responsibility of the authors and do not necessarily represent the official view of the National Institute of Health (NIH) or Delta Dental of Wisconsin.

Conflict of interest

The authors declare no real or potential conflicts of interest.

References

U.S. Department of Health and Human Services. U.S. Department of Health and Human Services. The Health Consequences of Smoking – 50 Years of Progress: A Report of the Surgeon General: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, National Center. Rockville, MD: 2014.

World Health Organization. WHO Report On The global tobacco epidemic, 2015 Raising taxes on tobacco. WHO Press 2015. www.who.int/tobacco (accessed November 2, 2017).

Warnakulasuriya

Sutherland

Scully

. Tobacco, oral cancer, and treatment of dependence. Oral Oncol 2005; 41: 244–60. doi: 10.1016/j.oraloncology.2004.08.010.

Khor

Yip

W-K

Bressel

Rose

Duchesne

Foroudi

. Practical implementation of an existing smoking detection pipeline and reduced support vector machine training corpus requirements. J Am Med Inform Assoc 2014; 21: 27–30. doi: 10.1136/amiajnl-2013-002090.

Liu

Shah

Jiang

Peterson

Dai

Aldrich

, et al. A study of transportability of an existing smoking status detection module across institutions. AMIA Annu Symp Proc 2012; 2012: 577–86.

Hecht

. Tobacco Smoke Carcinogens and Lung Cancer. JNCI J Natl Cancer Inst 1999; 91: 1194–210. doi: 10.1093/jnci/91.14.1194.

White

D’Aloisio

Nichols

DeRoo

Sandler

. Active Tobacco Smoke and Environmental Tobacco Smoke Exposure During Potential Biological Windows of Susceptibility in Relation to Breast Cancer. Cancer Epidemiol Biomarkers Prev 2016; 25: 562–562. doi: 10.1158/1055-9965.EPI-16-0091.

Jamal

Dube

Malarcher

Shaw

Engstrom

, Centers for Disease Control and Prevention (CDC). Tobacco use screening and counseling during physician office visits among adults – National Ambulatory Medical Care Survey and National Health Interview Survey, United States, 2005–2009. MMWR Suppl 2012; 61: 38–45.

Siu

. Behavioral and Pharmacotherapy Interventions for Tobacco Smoking Cessation in Adults, Including Pregnant Women: U.S. Preventive Services Task Force Recommendation Statement. Ann Intern Med 2015; 163: 622. doi: 10.7326/M15-2023.

10.

Shimpi

Bharatkumar

Jethwani

Chyou

P-H

Glurich

Blamer

, et al. Knowledgeability, Attitude and Behavior of Primary Care Providers Towards Oral Cancer: a Pilot Study. J Cancer Educ 2016: 1–6. doi: 10.1007/s13187-016-1084-4.

11.

Clark

Good

Jezierny

Macpherson

Wilson

Chajewska

, et al. Identifying smokers with a medical extraction system. J Am Med Inform Assoc 2008; 15: 36–9. doi: 10.1197/jamia.M2442.

12.

Hazlehurst

Sittig

Stevens

Smith

Hollis

Vogt

, et al. Natural language processing in the electronic medical record: assessing clinician adherence to tobacco treatment guidelines. Am J Prev Med 2005; 29: 434–9. doi: 10.1016/j.amepre.2005.08.007.

13.

Centers for Disease Control and Prevention (CDC). Centers for Disease Control and Prevention. Meaningful use 2017. http://www.cdc.gov/ehrmeaningfuluse/introduction.html (accessed November 22, 2016).

14.

Rostami

Hegde

Shimpi

Pack

Olson

Acharya

. Oral Cancer Risk Assessment Using Machine Learning Algorithms. 2016 AADR/CADR Annu. Meet. (March 16–19, 2016) (Los Angeles), Los Angeles: Journal of Dental Research 95(A): 1486; 2016.

15.

Hyun

Johnson

Bakken

. Exploring the ability of natural language processing to extract data from nursing narratives. Comput Inform Nurs 2009; 27: 215-23-5. doi: 10.1097/NCN.0b013e3181a91b58.

16.

Valdez

Rueschman

Kim

Redline

Sahoo

. An Ontology-Enabled Natural Language Processing Pipeline for Provenance Metadata Extraction from Biomedical Text (Short Paper). move to meaningful Internet Syst…: CoopIS, DOA, ODBASE Confed. Int. Conf. CoopIS, DOA, ODBASE… proceedings. OTM Confed. Int. Conf., vol. 10033, 2016, p. 699–708. doi: 10.1007/978-3-319-48472-3_43.

17.

Dorr

Gaasterland

. Exploiting aspectual features and connecting words for summarization-inspired temporal-relation extraction. Inf Process Manag 2007; 43: 1681–704. doi: 10.1016/j.ipm.2007.01.008.

18.

Hegde

Shimpi

Pack

Rostami

Acharya

. Smoking Status Classification Of Clinical Notes Using Natural Language Processing. 2016 AADR/CADR Annu. Meet. (March 16–19, 2016) (Los Angeles), Los Angeles: Journal of Dental Research 95(A): 0997; 2016.

19.

Zeng

Goryachev

Weiss

Sordo

Murphy

Lazarus

. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Decis Mak 2006; 6: 30. doi: 10.1186/1472-6947-6-30.

20.

Wang

Akella

. A Hybrid Approach to Extracting Disorder Mentions from Clinical Notes. AMIA Jt Summits Transl Sci Proc AMIA Summit Transl Sci 2015; 2015: 183–7.

21.

Gould

Sakoda

Ritzwoller

Simoff

Neslund-Dudas

Kushi

, et al. Monitoring Lung Cancer Screening Utilization and Outcomes in Four Cancer Research Network Sites. Ann Am Thorac Soc 2017: AnnalsATS.201703-237OC. doi: 10.1513/AnnalsATS.201703-237OC.

22.

Regan

Meigs

Grinspoon

Triant

. Determinants of Smoking and Quitting in HIV-Infected Individuals. PLoS One 2016; 11: e0153103. doi: 10.1371/journal.pone.0153103.

23.

Cavazza

Zweigenbaum

. Extracting implicit information from free text technical reports. Inf Process Manag 1992; 28: 609–18. doi: 10.1016/0306-4573(92)90030-4.

24.

Park

Hartzler

Huh

McDonald

Pratt

. Automatically Detecting Failures in Natural Language Processing Tools for Online Community Text. J Med Internet Res 2015; 17: e212. doi: 10.2196/jmir.4612.

25.

Varpio

Rashotte

Day

King

Kuziemsky

Parush

. The EHR and Building the Patient’s Story: A Qualitative Investigation of How EHR Use Obstructs a Vital Clinical Activity. Int J Med Inform 2015; 84: 1019–28. doi: 10.1016/j.ijmedinf.2015.09.004.

26.

Nikfarjam

Emadzadeh

Gonzalez

. Towards generating a patient’s timeline: Extracting temporal relationships from clinical notes. J Biomed Inform 2013; 46: S40–7. doi: 10.1016/j.jbi.2013.11.001.

27.

Wang

Ruan

Yang

Liu

. Comparison of Three Information Sources for Smoking Information in Electronic Health Records. Cancer Inform 2016; 15: 237–42. doi: 10.4137/CIN.S40604.

28.

Savova

Masanz

Ogren

Zheng

Sohn

Kipper-Schuler

, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Informatics Assoc 2010; 17: 507–13. doi: 10.1136/jamia.2009.001560.

29.

Husten

. How should we define light or intermittent smoking? Does it matter? Nicotine Tob Res 2009; 11: 111–21. doi: 10.1093/ntr/ntp010.

30.

Hassmiller

Warner

Mendez

Levy

Romano

. Nondaily Smokers: Who Are They? Am J Public Health 2003; 93: 1321–7. doi: 10.2105/AJPH.93.8.1321.

31.

Kang

Singh

Afzal

van Mulligen

Kors

. Using rule-based natural language processing to improve disease normalization in biomedical text. J Am Med Inform Assoc 2013; 20: 876–81. doi: 10.1136/amiajnl-2012-001173.

Tobacco use status from clinical notes using Natural Language Processing and rule based algorithm

Abstract

BACKGROUND:

OBJECITVE:

METHODS:

RESULTS:

CONCLUSIONS:

Keywords

1. Introduction

1.1 Related work

1.2 Research objective

2. Materials and methods

2.1 Data retrieval

2.1.1 Phase I: Corpus-level classification

2.1.1.1. Post-processing

2.1.1.2. Validation

2.1.1.3. Statistical analyses

2.1.2.1. Validation

3.1 Phase I

3.2 Phase II

4. Implications

Table 1 Predicted classification for Phase I (cTakes) in form of confusion matrix

Footnotes

Acknowledgments

Conflict of interest

References

Table 1
Predicted classification for Phase I (cTakes) in form of confusion matrix