Abstract
Many bibliographic databases describe the content of a publication using a thesaurus. The vocabularies vary and the extent to which the databases apply them may also differ significantly. The aim of this study is to empirically explore the number of subject headings assigned to publications in two databases over time and to determine if publication characteristics are associated with the number of subject headings. Articles and reviews in MEDLINE and Embase from 1990 to 2019 assigned with one of the subject headings from six subject areas are included in this study. Each of the retrieved publications in Embase is matched with a similar publication in MEDLINE. Furthermore, multivariable linear regressions are used to explore the association of the number of subject headings in MEDLINE and Embase with six prespecified publication characteristics. The average number of assigned subject headings in MEDLINE is stable or even slightly decreasing over time. In Embase, the average number of assigned subject headings was stable until about 2000 where the average number increased dramatically during the next 3 years. Furthermore, linear regressions show that the average number of subject headings in MEDLINE and Embase is higher for publications in English, publications with longer abstract, recent publications and if it belongs to specific subject areas. However, reviews are assigned with more subject headings in Embase and fewer in MEDLINE. The implications of the results are discussed.
1. Introduction
A thorough literature search requires using a combination of two retrieval approaches when searching bibliographic databases: one is based on natural language and the other is based on controlled vocabularies [1]. Searches based only on natural language (keywords) are generally considered insufficient for capturing all relevant publications and should be combined with entries from the controlled vocabulary [2–5]. Natural language is the language spoken and written by people, which in a bibliographic database occurs in titles, abstracts and author keywords [6]. The controlled vocabulary is an artificial language with limited and defined vocabulary, syntax, semantics and pragmatics [6]. A controlled vocabulary can be as simple as an authorised list of terms also known as an authority file, but it can also be a much more complex tool in the shape of a thesaurus [7]. A controlled vocabulary can be defined as a controlled list of terms that have been enumerated explicitly, each of which has an unambiguous, non-redundant definition [8].
The information in a thesaurus is intended for indexers and indexing systems, as well as by searchers and end-users [9]. The thesaurus is aimed to improve the recall and precision of the searches performed and using the thesaurus may contribute with unique hits [10,11]. The thesaurus allows the user to search for familiar terms if the user does not know the specialised terms [12]. The thesaurus thus helps users select the exact term they need and shows structural relationships between vocabulary terms [13]. The benefits of using controlled vocabularies over keyword searches have been examined intensively, and the reader is referred to a review for an overview of this research [14].
Vocabulary resources with synonyms, hierarchical structures and conceptual relationships provide powerful tools for retrieval across disparate data. However, the use of these is often a cause for concern in the search strategies of existing systematic reviews, and many of the most common errors involve the controlled vocabulary [15–18]. The problems of using the controlled vocabularies can to some extent be explained by the weaknesses of the controlled vocabulary [19]. Among these weaknesses are that if the controlled vocabulary is not invoked automatically, the searcher needs to learn the artificial language to fulfil the potential of the controlled vocabulary [19]. Furthermore, databases may have very different controlled vocabularies. A wide variety of vocabulary control tools such as thesauri exists [13], and the specific tool may affect retrieval. A previous version of The Cochrane Handbook argued that ‘Searches of EMBASE may, therefore, retrieve additional articles that were not retrieved by a MEDLINE search, even if the records were present in both databases’ [20]. Consequently, the controlled vocabulary is by no means a trivial part of planning systematic, thorough searches and searchers should be aware of their differences. Examples of these differences include differences between Medical Subject Headings (MeSH, the controlled vocabulary used by MEDLINE) and CINAHL headings (the controlled vocabulary used by Cumulative Index to Nursing and Allied Health Literature, CINAHL) regarding qualitative literature [21,22] and differences between MeSH and Emtree (the controlled vocabulary used by Embase) [23,24]. A white paper by Elsevier on the differences between MeSH and Emtree argues that the number of preferred terms in Emtree is considerably larger than in MeSH [25].
The vocabularies may be different, but the extent to which the databases apply them may also differ significantly. This is due to differences in indexing policy. It has been reported that more subject headings are assigned to each paper in Embase than in MEDLINE although there are only a few investigations of the supposedly higher average number of subject headings in Embase and often Elsevier, the company behind Embase, is cited [20,26,27]. The number of publications retrieved when performing a search using subject headings increases as the number of subject headings assigned to each publication is increased [26,28,29]. However, some of the terms may be only marginally relevant for the specific publication [28]. There is no optimal number of subject headings [30,31], but when indexing publications, there needs to be a balance between recall and precision in a large database. A large number of subject headings increase retrieval, but will also increase the number needed to screen [32]. Consequently, searchers may experience unnecessarily high sensitivity and low precision in Embase searches. It has been explored whether higher precision can be achieved in Embase when using some or all major subject headings in the search, rather than all subject headings, including non-major. However, the search strategies were weakened using the focusing technique and cannot be recommended in general [28].
Furthermore, the subject headings are essential for retrieving documents with little other useful information available for the searcher (e.g. an article with an uninformative title). The length of the abstract is considered an important factor in information retrieval and the longer the better [7]. If an abstract is very short or even lacking, the subject headings play an important role in retrieving the paper. However, indexers are influenced by, for example, the abstract [33], and the subject headings of a document are thus interdependent. The aim of this study is to empirically explore the number of subject headings assigned to papers in Embase and MEDLINE over time and to determine if publication characteristics are associated with the number of subject headings.
2. Subject headings in Embase and MEDLINE
The controlled thesauruses used by Embase and MEDLINE cover similar fields and the function of the thesauruses is the same, that is, to support the literature search process via a focused and formalised description of the content of the publications [9,19]. Embase subject headings are referred to as Emtree terms and MEDLINE subject headings, MeSH terms. However, they differ from each other in several ways.
The specificity of subject headings varies, including Emtree >80,000 terms [34], whereas MEDLINE contains 29,900 terms [35]. About half of the subject headings in Embase are drug and chemical terms, which underline Embase’s focus on the pharmacological literature [34]. Emtree and MeSH are quite different when it comes to drug groups and terms. Serotonin reuptake inhibitors can serve as an example to illustrate the differences between the two thesauruses. In Embase, in the hierarchy under the Emtree term, Antidepressant Agent you will find Serotonin Uptake Inhibitor, and again a large number of systemic and drug names within this group of drugs. In MEDLINE, the hierarchy is as shown in the following.
2.1. Emtree
- Antidepresssant Agent
○ Serotonin Uptake Inhibitor
▪ Citalopram
▪ Dapoxetine
▪ Escitaloprama
▪ Fluoxetine
▪ Fluvoxamine
▪ Hyperforin
▪ Nomelidinea
2.2. MeSH
- Antidepressive Agents
○ Antidepressive Agents, Second Generation
- Neurotransmitter Agents
○ Neurotransmitter Uptake Inhibitors
▪ Serotonin Uptake Inhibitors
- Serotonin Uptake Inhibitors (Pharmacological Action)
○ Citalopram
○ Fluoxetine
○ Fluvoxamine
- Organic Chemicals
○ Amines
▪ Citalopram
▪ Fluoxetine
- Organic Chemicals
○ Amines
▪ Hydroxylamines
• Oximes
○ Fluvoxamine
The differences in the structure of the hierarchy are important if you want to retrieve literature indexed with Serotonin Reuptake Inhibitors using the subject headings. In Embase, you can search for the Emtree term Serotonin Uptake Inhibitor and use the explode function which means that publications assigned with narrower subject headings are included in the search (e.g. articles indexed with Citalopram). If you want an equivalent search in MEDLINE, you cannot just search for the MeSH term Serotonin Uptake Inhibitors, as this subject heading does not include the underlying drugs (e.g. Citalopram and Fluoxetine). Instead, you must use the subject heading Serotonin Uptake Inhibitors (Pharmacological Action).
3. Methods
To analyse the use of subject headings in Embase and MEDLINE, we need to compare a matched set of publications from the two databases. The databases overlap in coverage; however, they do also both index many unique publications [36–38]. Consequently, we cannot compare the publications in the two databases directly, and we therefore match publications from database with the same publications indexed in the other database.
For the purpose of this study, we extract data from Embase through Embase.com and from MEDLINE through Web of Science. A sample of bibliographic records is exported as a CSV file from both databases for analyses. We have selected the following six subject areas to analyse: low back pain, psoriasis, midwife/midwifery, ophthalmology, urology and lithium. The subject areas have been selected to represent various subject areas within health sciences. All publications within these subject areas in both databases are retrieved using the subject headings covering these subject areas, and all narrower terms as well as subheadings are included in the search. We use the subject headings available in the thesauri at the time of data collection (November 2020). Searching MEDLINE data available in Web of Science provides different search results than searching PubMed. A recent study finds that the search results in MEDLINE varied considerably across five different vendor platforms, ‘indicating that there is no one MEDLINE’ [39, p. 17/21]. In this study, data are extracted to identify a set of overlapping publications in MEDLINE and Embase, not to explore coverage or search performance. Differences in results are therefore acceptable as we are only focusing on the overlap. The following subject headings are searched in this study (see Appendix 1 for the use of subject headings in the searches):
MEDLINE subject headings: psoriasis, low back pain, lithium, midwifery, urology and ophthalmology.
Embase subject headings: psoriasis, low back pain, lithium, midwife, urology and ophthalmology.
To allow for reasonable comparisons only, we include articles and reviews and exclude editorials, meeting abstracts and other publication types with less scientific content, since they would naturally be assigned with fewer subject headings. From Embase, we include articles, articles in press and reviews. In MEDLINE, there are other publication types, and therefore, we exclude the following publication types: editorial, news, letter, biography, bibliography, dictionary, portrait, personal narrative, newspaper article, comment, autobiography, interview, patient education handout and periodical index. Publications from 1990 to 2019 are included in this study. Including publications before 1990 results in too few publications each year to allow for analyses of any potential development over time.
We match each of the retrieved publications in Embase with the identical publication in MEDLINE. The match is based on the PMID which is assigned to all MEDLINE records, and the PMID is also available for MEDLINE-indexed records in Embase. Using the above-mentioned subject areas to draw a sample from the databases, records from both databases were extracted and the PMIDs matched. Records from Embase and MEDLINE are matched if the PMIDs are an exact match. The data from the matched records are then merged. Consequently, each record then contains information from MEDLINE as well as Embase. The number of subject headings in each database is counted. The data are exported as a CSV file which in this case stores tabular data in plain text. Each line of the file is a data record that contains the same number of fields and the fields are tab-delimited or comma-separated. Data within the field are also separated, and the number of subject headings can thus be counted. The number of subject headings identified in the data was confirmed manually in the two databases for a random sample of 100 records.
The number of subject headings is analysed within each subject area. Furthermore, we carried out multivariable linear regression analyses to explore the association of the number of subject headings in MEDLINE and Embase with six prespecified publication characteristics, respectively. These included the following: (1) type of publication (review or article), (2) language of publication (English vs non-English), (3) length of abstract (number of signs in abstract), (4) publication year (1990–2019) and (5) subject area. The analyses are made using Excel (version 16.54) and SPSS (version 28).
4. Results
An overview of the subject areas, the retrieved publications and the resulting matches is available in Table 1.
Included subject areas, retrieved publications and number of matches.
The subject headings are sorted in descending order of matches. We notice that the relative overlap between these six subject headings in MEDLINE and Embase varies greatly. This can be caused by differences in the indexed journals and publication types as well as the use of the subject headings when indexing.
Now turning to the results of the analysis of the number of subject headings. Figure 1 provides an overview of the development in the number of assigned subject headings in the total data set. Consequently, results from the matched publications from all six subject area are shown in Figure 1.

Average number of subject headings assigned to articles and reviews in MEDLINE and Embase.
We notice that the average number of assigned subject headings in MEDLINE is stable or even slightly decreasing over time. The average number ranges from a little under 15 to a little over 17. In Embase, the average number of assigned subject headings was stable until about 2000 where the average number increased dramatically during the next 3 years. From 1990 to 2000, the average number of assigned subject headings ranged from about 14 to 19, whereas in the period from 2003, the average number of assigned subject headings ranged from 23 to 25.
However, the average number of subject headings varies across subject areas. Figure 2 shows the analysis of the number of subject headings for each field.

Average number of subject headings assigned to articles and reviews in MEDLINE and Embase for six subject areas.
In Figure 2, we can see that all six subject areas have experienced an increase in the average number of subject headings in Embase in 2000 to 2003. We also notice that on average, more subject headings are assigned to articles and reviews in some subject areas than in other subject areas. In low back pain and psoriasis, more subject headings are assigned on average than in lithium, ophthalmology, urology and midwifery.
To further our insight on the number of subject headings assigned in MEDLINE and Embase, two linear regressions were computed. First, a multiple linear regression was calculated to predict the number of subject headings in MEDLINE based on publication characteristics. A significant regression equation was found (F(9, 35647) = 950.449, p < 0.001), with an R 2 of 0.194. Second, a multiple linear regression was calculated to predict the number of subject headings in Embase based on publication characteristics. A significant regression equation was found (F(9, 35647) = 862.573,p < 0.001), with an R 2 of 0.179. The relatively low R 2 of both models should be noted. Values between 0 and 0.3 indicate a weak positive linear relationship through a shaky linear rule [40]. Consequently, the number of subject headings is correlated with other factors not included in this analysis. An overview of both regressions is available in Table 2.
Multiple linear regressions to predict the number of subject headings in MEDLINE and Embase based on publication characteristics.
CI: confidence interval.
The average number of subject headings assigned to a paper in MEDLINE is slightly increasing over time (the average number of subject headings increases with 0.1 over a 10-year period). The average number of subject headings is higher for papers in English (1.2 more subject headings) holding all of the other independent variables constant. The average number of subject headings is lower for reviews than articles (3.8 fewer subject headings). Papers with longer abstracts are assigned with slightly more subject headings (0.2 more subject headings per 100 more signs). Finally, papers within some subject areas are assigned with more subject headings than others (ophthalmology papers are on average assigned with 3.9 less subject headings than psoriasis papers).
The average number of subject headings assigned to a paper in Embase is increasing over time (the average number of subject headings increases with 2.96 over a 10-year period). The average number of subject headings is much higher for papers in English (6.1 more subject headings). The average number of subject headings is higher for reviews than articles (2.0 more subject headings). Papers with longer abstracts are assigned with slightly more subject headings (0.25 more subject headings per 100 more signs). Finally, papers within some subject areas are assigned with more subject headings than others (urology papers are on average assigned with 3.8 less subject headings than psoriasis papers).
Summing up, the average number of subject headings in MEDLINE and Embase increases if the paper is in English, has a longer abstract, if the paper is recent and if it belongs to specific subject areas. However, reviews are assigned with more subject headings in Embase and fewer in MEDLINE.
5. Discussion
The results of this study clearly confirm that over time, Embase assigns considerably more subject headings to each article in the six subject areas than does MEDLINE. The difference in average number of assigned increased particularly from 2000 to 2003. Furthermore, we find that the average number of subject headings varies across subject areas. The fact that more subject headings are assigned to articles and reviews may be used in the preparation of systematic searches. Since Embase assigns more subject headings to each article, investigating the subject headings of a number of relevant articles in Embase at first may kickstart the exhaustive identification of search terms and may even contribute to a more comprehensive search, since more synonyms may be identified (e.g. via the ‘used for terms’ specified in Embase) and included in the keyword search, which may increase recall. In the existing literature, the higher number of subject headings in Embase has been reported [26,28,29], but it has not been investigated systematically, and consequently, we cannot compare our results to the results of existing studies.
The higher number of subject headings in Embase than in MEDLINE affects recall as well as precision. During the first 10 years of analysis (1990–2000), the difference in the average number of assigned subject headings is very small, but for the last 17 years of analysis, the difference in the average number of subject headings range from 7.4 to 9.5. Consequently, there is an average of more than seven more subject headings per publication in Embase during that period. The thesaurus is aimed to improve the recall and precision of the searches performed [10,11,20]. However, it has been claimed that some of the terms may be only marginally relevant for the specific publication, and thus precision suffer [28]. A large number of subject headings may improve recall but will also increase the number needed to screen [32]. In this study, we are not assessing the relevance of each publication, and therefore, we cannot assess recall and precision. However, all other things equal, we would expect an increase in recall. Furthermore, the white paper by Elsevier argues that Emtree is easier to use because Emtree is more intuitive and Embase users do not need to look up scope notes to understand terms [25]. Further studies are needed to explore this. This is, of course, of great importance to the users of biomedical databases. The searchers use the thesaurus to improve the recall and precision of their searches, and therefore, we welcome the studies that explore the effect of thesaurus use on recall and precision.
The average number of subject headings is correlated with publication characteristics. The average number of subject headings in MEDLINE and Embase is higher if the paper is in English, has a longer abstract, if the paper is recent and if it belongs to specific subject areas. Language restrictions in literature reviews should be avoided due to the risk of language bias, and any language restrictions should not be imposed in the search strategy [41]. However, it can affect recall when searching without language restriction as fewer subject headings are assigned to non-English publications. Furthermore, most of the existing medical document classification studies are carried out with documents in English [42]. Reviews are generally assigned with more subject headings in Embase and fewer in MEDLINE, and the consequences for recall are therefore difficult to determine. Finally, papers with very short or no abstracts are less probably to be retrieved using keywords, but as fewer subject headings are assigned, the effect on recall increases. Abstract length varies considerably across fields [43–45]. The differences seen in journals are also to some extent caused by different length limits, and positive correlations are found between the abstract length limit and actual abstract length [43]. Although the size of the effect is found to be small in this study, a difference between a publication with an abstract and one without can easily be 1000 signs and that would imply a difference in assigned subject headings of 2 or 2.5. More research is needed to explore this correlation. The searcher relies greatly on the subject headings when only a short or no abstract is available, and the correlation between abstract length and number of subject headings should be investigated further. It may be related to the use of machine learning and a rule-based system in indexing. The National Library of Medicine has developed an automated indexing system (Medical Text Indexer or MTI) which processes article titles and abstracts and returns a list of recommended MeSH terms [46]. The performance of these methods is limited by the input which only includes the title and abstract [47], and the consequences for recall and precision need to be explored further.
Before turning to the conclusions, we need to consider the limitations of this study:
First, we only compare the number of subject headings, not the relevance of each subject heading, when searching for a specific topic. This is also why we cannot analyse the impact on recall and precision. Another aspect is the use of subheadings, but we cannot determine if the difference is caused by fewer or more assigned subject headings, or if the difference is caused by differences in the use of subheadings. This would be very interesting to investigate further but is beyond the scope of this study.
Second, we have selected six areas to investigate; however, they are not necessarily representative of all subject areas and further studies are needed if we are going to make more general conclusions regarding the number of assigned subject headings in Embase and MEDLINE.
Third, the two databases cover different publication types, and therefore, our results cannot be generalised to all database contents. We have only focused on articles and reviews whereas editorials, meeting abstracts, conference abstracts and other publication types covered by these two databases are not included in this study.
6. Conclusion
The results of this study show that over time Embase assigns considerably more subject headings to each article in these six areas than MEDLINE. The difference in average number of assigned increased particularly from 2000 to 2003. The clear tendency to an increase in the number of subject headings is found in all six subject areas; however, the average number of assigned subject headings varies across subject areas. Furthermore, linear regressions show that the average number of subject headings in MEDLINE and Embase is higher for publications in English, publications with longer abstract, recent publications and if it belongs to specific subject areas. However, reviews are assigned with more subject headings in Embase and fewer in MEDLINE.
There are several implications of this study. First, information searchers should be aware of fundamental changes in database indexing policy as they have an impact on recall and precision. The average number of assigned subject headings varies considerably depending on the field, document type, publication year and language. Furthermore, this study has implications for research using subject headings in MEDLINE and Embase. Recent studies show that subject headings have been used in several studies for different purposes. MeSH can be valuable when estimating the similarity in topics of a pair of papers [48] when working with author name disambiguation. MeSH can also be used to explore the association of health problems with quality consumer-level information [49]. Studies using subject headings from MEDLINE and Embase need to consider the factors that have an impact on the number of subject headings assigned.
Footnotes
Appendix
Searches.
| MEDLINE (Web of Science) | Embase (Embase.com) |
|---|---|
| Psoriasis (MeSH Heading) | ‘Psoriasis’/exp |
| Low back pain (MeSH Heading) | ‘Low back pain’/exp |
| Lithium (MeSH Heading) | ‘Lithium’/exp |
| Midwifery (MeSH Heading) | ‘Midwife’/exp |
| Urology (MeSH Heading) | ‘Urology’/exp |
| Ophthalmology (MeSH Heading) | ‘Ophthalmology’/exp |
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
