Finding Text-Supported Gene-to-Disease Co-appearances with MOPED-Digger

Abstract

Gene/disease associations are a critical part of exploring disease causes and ultimately cures, yet the publications that might provide such information are too numerous to be manually reviewed. We present a software utility, MOPED-Digger, that enables focused human assessment of literature by applying natural language processing (NLP) to search for customized lists of genes and diseases in titles and abstracts from biomedical publications. The results are ranked lists of gene/disease co-appearances and the publications that support them. Analysis of 18,159,237 PubMed title/abstracts yielded 1,796,799 gene/disease co-appearances that can be used to focus attention on the most promising publications for a possible gene/disease association. An integrated score is provided to enable assessment of broadly presented published evidence to capture more tenuous connections. MOPED-Digger is written in Java and uses Apache Lucene 5.0 library. The utility runs as a command-line program with a variety of user-options and is freely available for download from the MOPED 3.0 website (moped.proteinspire.org).

Introduction

Much of the knowledge about the relationships between genes and diseases is locked away in a massive and continuously expanding number of journal publications. Extracting the relationships between human diseases and genes from publications is particularly challenging with multiple synonyms and alternate terms for both genes and diseases obscuring potentially life-saving discoveries. To offer help, we present a software utility, “MOPED-Digger,” which performs a large-scale dictionary based gene/disease term search using Apache Lucene 5.0 library (Apache Lucene, http://lucene.apache.org) with results mapped to key terms. With the application of natural language processing (NLP), MOPED-Digger enables focused human assessment of biomedical literature by building ranked lists of gene/disease co-appearances and the publications that support them.

Methods

MOPED-Digger uses two customizable lists (genes, diseases) to search an index of documents for matches. Each gene and disease on the lists has a key symbol and synonyms. To build a searchable index, MOPED-Digger reads the title and abstract from a PubMed file into memory, breaks the combined title/abstract content into searchable text elements (see Supplementary Information; supplementary material is available online at www.liebertpub.com/omi), and adds the result to the index. MOPED-Digger's current gene list contains approximately 21,000 protein-coding genes' key symbol and their synonyms established from the UniProt human reference proteome (The UniProt Consortium, 2015).

The current gene list can be integrated with the multi-omics expression data in MOPED (Multi-Omics Pathway Expression Database MOPED; Higdon et al., 2012; Higdon et al., 2013; Kolker et al., 2012; Montague et al., 2014; 2015). After a gene key symbol or synonym is found in the indexed text a Gene Match Score is calculated to provide a match confidence level: high, medium, or low (see Supplementary Information).

The disease list was obtained from the Disease Ontology (DO) database and it currently contains approximately 9000 disease names and synonyms (Kibbe et al., 2014). It is also customizable, and it is possible to use just one or a focused set of disease terms. MOPED-Digger searches for both key diseases (primary search term) and synonyms using a Lucene proximity search. Synonyms are mapped back to the key disease (see Supplementary Information).

Searches of PubMed articles result in two mappings: PubMed ID −> key gene symbol and PubMed ID −> key DO diseases. After the two mappings are generated, the PubMed ID is used to integrate and build a final results file that includes the gene/disease terms, an Integrated score, scores from two external websites that analyze gene and disease terms (Piñero et al., 2015; Pletscher-Frankild et al., 2015), and the list of article identifiers in which the gene/disease co-appearance was found. By utilizing an NLP approach, MOPED-Digger determines meaningful matches in free text publications with challenging multiple synonyms and alternate terms (see Fig. 1).

FIG. 1.

MOPED-Digger Gene–Disease co-appearance finding process. This is MOPED-Digger's process of finding co-appearances of genes and diseases from PubMed abstracts.

Results and Discussion

The NCBI disease corpus is a “Gold Standard” set of manually curated publications with disease terms identified that can be used by the biomedical natural language processing community (Doğan et al., 2014). Gene and disease terms were identified by MOPED-Digger in the NCBI disease corpus abstracts that were downloaded on 04/25/2015. The NCBI-identified disease terms within the abstracts were manually mapped to the disease terms in MOPED-Digger's DO disease list. Given that both MOPED-Digger and the manual curation mapped all terms found to the same key disease terms (DO disease names), the comparison of results was straightforward.

Of the 620 possible gene matches in the gold standard set, using the high confidence score matches (minimal score of 10), 429 were found, 291 were missed, and 70 false positives were “found”, for 86% precision and 60% recall (see Table 1 and Supplementary Information). Of the 1415 possible disease matches, 881 were found, 534 were missed, and 217 false positives were “found” for 80% precision and 62% recall with proximity search value of 3 (medium proximity value, see Table 2 and Supplementary Information). MOPED-Digger was applied to MEDLINE/PubMed Baseline Database (PubMed) with over 23 million records. 18,159,237 abstracts were analyzed (publishing years 1980–2015), including all available articles on 02/25/2015. 1,796,799 unique gene-disease co-appearances were found (see Supplementary Information).

Table 1.

MOPED-Digger Gene Search Performance on NCBI Disease Corpus Abstracts Based on Score Threshold

Minimum score	True positive (TP)	False positive (FP)	False negative (FN)	Precision (PPV) %	Recall (Sensitivity) %	F-Score %
1	598	272	122	68.74	83.06	75.22
5	491	148	229	76.84	68.19	72.26
10	429	70	291	85.97	59.58	70.39

The above metrics are calculated as follows: Precision = TP/(TP+FP), Recall = TP/(TP+FN), F-score = 2^*precision^*recall/(precision + recall).

Table 2.

MOPED-Digger Disease Search Performance on NCBI Disease Corpus Abstracts Based on Proximity Value Specified in Search

Proximity	True Positive (TP)	False Positive (FP)	False Negative (FN)	Precision (PPV) %	Recall (Sensitivity) %	F-Score %
1	864	175	551	83.16	61.06	70.42
3	881	217	534	80.24	62.26	70.12
5	884	264	531	77.00	62.47	68.98

A higher proximity value brings in more true positives (TP) but it also leads to more false positives (FP).

The customization options of MOPED-Digger provide opportunities to use the tool in many different ways. This tool can be used to determine genes associated with multiple diseases, co-morbidities, and symptoms. We used MOPED-Digger to search for a custom list of disease search terms for Gastrointestinal symptoms related to Autism Spectrum Disorder. By integrating the gene search results and these two disease search results, we were able to identify genes that co-appeared in abstracts with both diseases' symptom terms (Higdon et al. 2015).

MOPED-Digger enables more effective use of the published collective efforts of the life sciences community. ultimately leading to more discoveries including gene-to-disease associations. Researchers can run the program locally with customized gene and disease lists. The results allow the user to focus attention on the most promising publications for a possible gene/disease association, while in addition broadly presenting published evidence to capture more tenuous connections.

Footnotes

Acknowledgments

The authors sincerely thank Rob Arnold and Larissa Stanberry for technical support, Vural Ozdemir, Rachel Earl, Caitlin Hudac, Jennifer Gerdts, and Raphael Bernier for clinical expertise and insightful collaboration, and Maggie Lackey for critical reading.

Funding. This work was supported by Seattle Children's Research Institute-Northeastern University to EK.

Author Disclosure Statement

No competing financial interests exist.

References

Doğan

, Leaman

, and Lu

. (2014). NCBI disease corpus: A resource for disease name recognition and concept normalization. J Biomed Inform, 47, 1–10.

Higdon

, Haynes

, Stanberry

, et al. (2012). Unraveling the complexities of life sciences data. Big Data, 1, 42–50.

Higdon

, Stewart

, Stanberry

, et al. (2013). MOPED enables discoveries through consistently processed proteomics data. J Proteome Res, 13, 107–113.

Higdon

, Earl

, Stanberry

, et al. (2015). The promise of multi-omics and clinical data integration to identify and target personalized healthcare approaches in autism spectrum disorders. OMICS, 19, 197–208.

Kibbe

, Arze

, Felix

, et al. (2015). Disease ontology 2015 update: An expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nuc Acids Res, 43, D1071–1078.

Kolker

, Higdon

, Haynes

, et al. (2012). MOPED: Model Organism Protein Expression Database. Nuc Acids Res, 40, D1093–1099.

Montague

, Stanberry

, Higdon

, et al. (2014). MOPED 2.5—An integrated multi-Omics resource: Multi-Omics Profiling Expression Database now includes transcriptomics data. OMICS, 18, 335–343.

Montague

, Janko

, Stanberry

, et al. (2015). Beyond protein expression, MOPED goes multi-omics. Nuc Acids Res, 43, Database issue, D1145–D1151.

Piñero

, Queralt-Rosinach

, Bravo

, et al. (2015). DisGeNET: A discovery platform for the dynamical exploration of human diseases and their genes. Database, 2015, 1–17. Website: disgenet.org. Last access: 4/25/15.

10.

Pletscher-Frankild

, Pallejà

, Tsafou

, et al. (2015). DISEASES: Text mining and data integration of disease-gene associations. Methods, 74, 83–89. Website: http://diseases.jensenlab.org/Search.

11.

The UniProt Consortium. (2015). UniProt: A hub for protein information. Nuc Acids Res, 43, D204–D212. Website: uniprot.org. Last access: 4/25/15.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB